Scientific Data
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Scientific Data's content profile, based on 174 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.
Niittynen, P.; Kemppinen, J.
Show abstract
We present here FennoTraits, which is a dataset of plant functional trait and community composition data which we collected from Fennoscandia across northern Finland, Norway, and Sweden in 2016-2025. This dataset has 42 049 abundance estimations and 155 794 functional trait observations from 10 traits representing 373 vascular plant species collected from 1 235 study sites within seven study areas. The trait measurements consist of size-structural, leaf economic, leaf spectral, and reproductive traits. The species represent the majority of the native vascular plant species that occur at the seven study areas, and many of the species occur in all seven areas across the two biomes and their ecotone: tundra and boreal forests. Each study area has distinct characteristics and a range of habitats: tundra, meadows, wetlands, shrublands, and boreal forests. These areas are under low anthropogenic influence, and many of the sites are within protected areas that are reserved for nature conservation and scientific research. Finally, we provide with this dataset a general description of the main trait patterns and profiles of the northern European flora.
Wu, J.; Perandini, L.; Batra, T.; Igoshin, S.; Bari, S.; de Araujo, A. L.; Willemink, M. J.
Show abstract
Digital breast tomosynthesis (DBT) is a powerful imaging modality that allows for improved lesion visibility, characterization, and localization compared to conventional two-dimensional digital mammography. DBT has been increasingly adopted in screening and diagnostic settings globally, particularly for women with dense breast tissue where tissue overlap presents a significant diagnostic challenge. Here we describe DBT-2026, a real world imaging dataset with 558 DBT exams from 558 patients with breast imaging reporting and data system (BI-RADS) scores of 0, 1, or 2. Each case contains one DBT examination in combination with expert annotations and free-text radiology reports that describe the radiological findings, produced in routine clinical practice. To protect patient privacy, all images and reports have been de-identified. The dataset is made freely available to researchers for non-commercial projects to facilitate and encourage research in breast cancer imaging.
Milne, L.; Simpson, C. G.; Guo, W.; Mayer, C.-D.; Milne, I.; Bayer, M.
Show abstract
We describe a major new release of the EoRNA database, a gene expression database for barley based on public data, first published in 20211. EoRNA v.2 (https://ics.hutton.ac.uk/eorna2/index.html) features an order of magnitude more samples and is based on a new automated workflow of sample discovery and processing which has enabled a dramatic scale-up the original database. EoRNA v.2 also features a major rebuild of the web user interface with rich new functionality. All infrastructure-related code and database schemas and web components are now species agnostic and publicly available for reuse with other taxa. A dedicated new reference transcript dataset has been created for EoRNA v.2 which is largely based on the recently published barley pan-transcriptome and represents the most comprehensive dataset of its kind to date.
Uiterwaal, S. F.; La Sorte, F. A.; Coblentz, K. E.; DeLong, J. P.
Show abstract
MotivationThe diet composition of a predator is a direct reflection of its role in a food web, resulting from interactions with prey species. Raptors (including hawks, owls, and falcons) are ubiquitous predators with diverse diets, yet there is no comprehensive database of raptor diet composition. We present a database of over 3500 raw raptor diet records, compiled from more than 1000 studies and representing 173 raptor species from across the world. Our dataset complements existing qualitative summaries of species diets by compiling thousands of quantitative diet "samples" over time and space to present diet data at a uniquely fine resolution. Main types of variable containedThe database comprises published records of raptor diets from pellets, prey remains, direct or photographic observations, prey DNA, and raptor gut or gullet contents. For each diet, we present the taxonomic identity and amounts of consumed prey. We additionally present various metadata for each diet such as location, habitat, and season. Spatial location and grainThe study incorporates diet records collected worldwide, with each record assigned geographic coordinates corresponding to the location where the diet information was obtained. Time period and grainThe database includes diet records from 1893 to 2025. We report a year for each diet record. Major taxa and level of measurementWe recorded raptor diet at the species level, including raptors from three orders: Strigiformes, Falconiformes and Accipitriformes excluding vultures. Most prey are identified to species, but prey taxonomic level varies depending on the extent to which they could be identified. Software formatDiet records and metadata are provided in two files with comma-separated value (.csv) format.
Friedrich, H.; Sahin, A. I.; Rajamani, N.; ALHO, E. J. L.; Milanese, V.; Oxenford, S.; Brammerloh, M.; Goede, L.; Matthies, C.; Volkmann, J.; Meyer, G.; Kirilina, E.; Pijar, J.; Horisawa, S.; Howard, C.; Garimella, A.; Li, N.; Fox, M. D.; Friedrich, M.; Bochtler, A.; Edlow, B. L.; Neudorfer, C.; Horn, A.
Show abstract
Clinical interventions and neuroimaging in the subcortex require anatomical definitions that exceed the resolution and anatomical detail of currently available deformable brain atlases. Here, we introduce a high-resolution human brain atlas comprising 95 manually segmented grey and white matter structures as well as 82 white matter tracts compiled from a multitude of resources including ex-vivo MRI, histology, fibre dissections, and neuroanatomy textbooks. The atlas is defined at an isotropic resolution of 100 m and can be precisely deformed to individual subject brain anatomy. By providing precise definitions of both grey and white matter structures within and around the basal ganglia, thalamus, subthalamus, midbrain and cerebellum, the atlas provides a foundational resource for stereotactic surgery and subcortical brain imaging research, as well as for development of next-generation neuromodulation strategies.
Wolters, F. C.; Woldu Semere, T.; Schranz, M. E.; Medema, M. H.; Bouwmeester, K.; van der Hooft, J. J. J.
Show abstract
Plants produce the most diverse blends of specialized metabolites on earth. Natural products derived from plants are valuable resources for drug development, food chemistry, and crop resistance breeding. Phenotypes of specialized metabolite profiles can be captured by untargeted mass-spectrometry across species phylogeny, tissues, and genotypes. Here, we collected metabolic fingerprints of 17 Brassicaceae species across three tissues (paired leaf and root; flower) using liquid chromatography-tandem mass spectrometry (LC-MS/MS) in positive and negative ionization mode. Corresponding metadata has been refined for reuse according to ReDU guidelines, and for integration with public genomic and transcriptomic data. Standardization of in vitro growth conditions, and data processing workflows enables integration of acquired raw and processed data across platforms for single- and multi-omics analysis. Further, the inclusion of tissue-specific metabolic profiles across ploidy levels, as well as across crop species and wild relatives, makes this dataset a valuable resource for natural product discovery.
Haueise, T.; Machann, J.
Show abstract
Chemical shift-encoded magnetic resonance imaging using high-resolved 3D Dixon techniques enables the non-invasive and radiation-free assessment of whole-body adipose tissue and ectopic fat distribution. Automatic deep learning-based segmentation of metabolically relevant adipose tissue compartments and ectopic fat deposits in parenchymal tissue is the most important image processing step for the quantification of adipose tissue volumes and ectopic fat percentages from whole-body imaging. This work presents a segmentation model dedicated to the segmentation of 19 metabolically relevant adipose tissue compartments and ectopic fat deposits from whole-body Dixon MRI. The trained segmentation model is available upon request. Related post-processing routines to compute volumes and fat percentages are publicly available: https://github.com/tobihaui/WholeBodyATQuantification.
Fleure, V.; Villeger, S.; Claverie, T.
Show abstract
Monitoring fish communities is essential for understanding biodiversity dynamics and coral reef ecosystem health. Underwater imaging provides a non-invasive and repeatable approach for such monitoring, yet analysis of large volumes of video data remains extremely time-consuming for experts. Resolving such a bottleneck is today within reach, yet towards automated fish identification, large and high-quality, labelled image datasets are critical for training and testing reliable deep learning models. However, to date, no such dataset exists for the Western Indian Ocean (WIO), a global biodiversity hotspot hosting more than 300 common non-cryptobenthic fish species and facing increasing anthropogenic pressures. This paper presents a novel and publicly available dataset of 114,664 images annotated from 186 videos recorded using fixed underwater cameras on shallow reef habitats from Mayotte archipelago. All images were labelled and validated by trained marine biologists following a standardized protocol. Each image includes detailed metadata describing recording conditions. The dataset comprises 124 reef fish species (including 110 with >200 images) and 8 background classes. This dataset will allow training and testing automated fish classification models.
Shareef-Trudeau, L.; Braun, D.; Bounyarith, T.; Li, J.; Peng, H.; Kucyi, A.
Show abstract
The integration of electroencephalography (EEG) and functional Magnetic Resonance Imaging (fMRI) can be used to characterize temporal and spatial components of neural activity during unfolding mental experience. Here we introduce a multi-session simultaneous EEG-fMRI dataset with measures of continuous behavior and spontaneous mental experience. Data components, organized in Brain Imaging Dataset Structure (BIDS) format, include fMRI, EEG with carbon wire loop sensors for artifact removal, continuous performance task responses, experience sampling ratings, and mental health surveys, from 24 healthy adults. Tasks included the gradual-onset continuous performance task and resting state with intermittent experience sampling of 13 unique thought dimensions (36 repetitions, including 468 total ratings, per participant). The same protocol was completed on two different days, yielding approximately 1.33 hours of simultaneous EEG-fMRI data per individual. The dataset may be used to explore the behavioral and experiential relevance of brain activity during the wakeful resting state. The dataset also provides a means to study the reliability of relationships between fMRI and EEG features across sessions within individuals.
Kambara, K.; Chen, Q.; Tsugama, D.
Show abstract
Grass Expression Atlas (GExA) is an interactive web-based resource for rapid exploration of gene expression across diverse tissues, developmental stages, and conditions in grass species. GExA integrates publicly available RNA sequencing (RNA-seq) datasets for four millets: pearl millet (Cenchrus americanus), foxtail millet (Setaria italica), proso millet (Panicum miliaceum), and finger millet (Eleusine coracana), and includes barley (Hordeum vulgare) and sorghum (Sorghum bicolor) as reference species. Datasets were processed using a unified processing workflow to generate expression values in transcripts per million (TPM). The current release comprises 4,673 samples from 442 BioProjects, including 987 pearl millet samples and 2,216 foxtail millet samples, and is provided through a user-friendly web interface. GExA is designed for scalable expansion to additional species via the pipeline used in this study. GExA is freely available at https://webpark2116.sakura.ne.jp/RNADB.
Madan, R.; Crane, P. K.; Gennari, J. H.; Latimer, C. S.; Choi, S.-E.; Grabowski, T. J.; Mac Donald, C. L.; Hunt, D.; Postupna, N.; Bajwa, T.; Webster, J.
Show abstract
1.Quantitative neuropathology has advanced through whole-slide imaging and digital histology platforms. Yet, these measurements rarely align with neuroimaging coordinate frameworks that may be useful for spatial modeling and other applications. QNPtoVox, short for quantitative neuropathology to voxels, is a reproducible, modular pipeline that transforms quantitative metrics generated by digital pathology software (HALO) into voxel-based maps registered to a standard common coordinate (MNI) template. The workflow integrates digital histopathology, gross tissue photography, ex-vivo MRI, and nonlinear registration to generate spatially standardized 3D pathology representations. This Methods article provides a complete procedural description, including required materials, step-wise instructions, operator-dependent checkpoints, expected outputs, reproducibility evaluation, and troubleshooting. QNPtoVox enables voxel-level integration of neuropathology with neuroimaging tools, unlocking existing histopathology datasets for computational modeling and cross-cohort harmonization.
Timm, L. E.; Hsieh, Y.; Lopez, J. A.; Almgren, S. A.; Glass, J. R.
Show abstract
Pacific herring (Clupea pallasii) serve as a critical trophic link between plankton and many marine species targeted by fisheries. With a broad distribution throughout the North Pacific Ocean, from the Arctic to temperate latitudes, herring hold ecological, economic, and cultural importance. Despite this importance, genomic resources for this species, such as reference genome sequences, have only recently become available. To date, only one scaffold-level reference genome, representing a specimen from the Gulf of Alaska (Vancouver; 1,379 scaffolds), has been published to NCBI. Addressing this data gap, we produced a high quality 795Mb genome sequence organized into 26 chromosomes combining long read sequencing with short read sequencing of proximity ligation libraries. Our assembly is highly complete (BUSCO score of 97.7%) and contiguous (922 contigs, N50 = 7,338,470, L50 = 38; 26 scaffolds, N50 = 31,494,017; L50 = 12). Pacific herring south of the Aleutian Islands and the Alaska Peninsula are genetically differentiated from those in the Bering Sea, making a reference genome from the eastern Bering Sea an important addition to the Pacific herrings genomic toolbox.
Ciraolo, A.; Scalona, E.; Zilli, A.; Nuara, A.; De Marco, D.; Rizzolatti, G.; Adamo, P.; Gatti, R.; Rossi, P.; Banfi, S.; Rocca, M. A.; Filippi, M.; Avanzini, P.; Fabbri-Destro, M.
Show abstract
The recovery of motor function is increasingly understood as a process influenced not only by physical training but also by perceptual and cognitive strategies. Action Observation Treatment (AOT), a neurorehabilitation approach in which patients observe goal-directed motor actions before executing them, has demonstrated clinical benefits; however, its wider implementation is hindered by a lack of standardized procedures. We present an open-access dataset of 33 upper limb gestures specifically developed to support the administration of Virtual Reality-based AOT (VR-AOT). The gestures were selected in collaboration with expert physiotherapists to ensure clinical relevance, and are provided as motion capture recordings along with Unity-based 3D animations embedded in configurable virtual scenes. The dataset is designed for flexibility, allowing users to modify parameters such as viewpoint, laterality, and repetition count. Technical validation confirms its usability and therapeutic applicability across multiple clinical and research contexts. This dataset offers a standardized yet customizable resource for developing and comparing VR-AOT protocols, with potential applications in neurorehabilitation and motor learning research.
Laskowski, L. F.; Gruys, M. L.; Huber, R.; DiGeronimo, A.; Arsham, A. M.; Chandrasekaran, V.; Rele, C. P.; Boies, L.
Show abstract
Gene Model for Insulin-like peptide 4 (Ilp4) in the D. simulans DsimGB2 assembly (GCA_000754195.3). The characterization of this ortholog was carried out as part of a larger, ongoing dataset designed to explore the evolution of the insulin/insulin-like growth factor signaling (IIS) pathway across the genus Drosophila, utilizing the Genomics Education Partnership gene annotation protocol within Course-based Undergraduate Research Experiences.
Niittynen, P.; Heikkinen, R. K.; Hällfors, M. H.; Määttänen, A.-M.; Norros, V.; Kemppinen, J.
Show abstract
The NordicTraits dataset provides the first comprehensive, imputed, and openly available species-level functional trait resource for all native vascular plants across Denmark, Finland, Iceland, Norway, and Sweden. Functional traits such as plant height, seed mass, and leaf nitrogen content are critical for understanding plant strategies, ecosystem processes including the services they provide to human society, and predicting biodiversity responses to environmental change. The Nordic region has a rich botanical history. However, the absence of a unified trait database has limited trait-based ecological research in this region that is under rapid climate change. To address this gap, we compiled and harmonized trait data from major global databases and regional sources, covering 3,099 vascular plant species. We utilized all together 205 traits in the imputation model with the source data covering, on average, 54% (5-81%) of the species. We employed rigorous data cleaning, taxonomic standardization, and a Random Forest-based imputation framework to fill the missing values, while incorporating phylogenetic information to improve accuracy. The final dataset includes 44 selected key functional traits with no missing values, including both continuous and categorical traits and enabling robust analyses of plant strategies and responses to environmental gradients across the regions diverse temperate, boreal, arctic, and alpine ecosystems. The dataset is particularly valuable for large-scale, multi-species studies, and those focusing on functional community assessments across a wide range of vegetation types. NordicTraits facilitates the paradigm shift from species-based to trait-based ecology, supporting research on biodiversity, conservation, and climate change impact predictions in northern Europe.
Manfrini, E.; Sauvion, N.; Maquart, P.-O.; Legal, L.; Blight, O.; Duquesne, E.; Hanot, C.; Bang, A.; Geslin, B.; Goebel, F.-R.; Fournier, D.; Berggren, A.; Javal, M.; Angulo, E.; Pincebourde, S.; Zakardjian, M.; Renault, D.; Le Lann, C.; Derocles, S.; Vayssieres, J.-F.; Leroy, B.; Courchamp, F.
Show abstract
Insect research remains hindered by limited data availability and fragmented knowledge compared to other, better-documented taxonomic groups. Increasingly, both the macroecological and the insect research communities highlight the need to integrate large-scale ecological trait datasets for insects. We present AnthropInsect, the largest database on insect traits to date, which uniquely includes variables describing human-insect associations. AnthropInsect describes species through 35 variables grouped into five categories: (i) taxonomic descriptors; (ii) ecological descriptors (native bioregions and habitat); (iii) human-insect associations (edibility and invasive status); (iii) functional traits (behavior, morphology, life history and feeding); (iv) and macroecological descriptors of native-range geography and climate. AnthropInsect currently includes 5,870 species across six major orders: Coleoptera, Lepidoptera, Hemiptera, Hymenoptera, Orthoptera and Blattodea. Data extracted from peer-reviewed and grey literature and from existing databases were standardized and curated with expert knowledge to ensure accuracy. By providing traits data with information on insect- human interactions, this rigorously curated resource supports global research in entomology, ecology, conservation, and global change.
Del Vecchio, A.; Enoka, R. M.
Show abstract
The scientific literature on human motor units and electromyography (EMG) spans over a century (1925-2025), comprising research impossible to synthesize manually. We introduce NeuromechaniX, a domain-specific platform for automated extraction and meta-analysis of this literature. The core component, MUscraper, is a large language model pipeline that extracts approximately 200 structured metadata fields, organized into 17 major sections spanning participant demographics, EMG acquisition parameters, muscle identification, task protocols, decomposition methods, and motor-unit outcomes, from [~]2,000 publications on human limb muscles. This automated extraction transforms heterogeneous narrative reports into a standardized, queryable database at a scale not achievable through manual review. From this dataset, we analyzed motor-unit discharge rate across 208 studies examining seven muscles. Our analyses reveal that discharge rates differ significantly among muscles (p<0.001), with biceps brachii exhibiting the highest rates (15.9 pps), followed by first dorsal interosseous (13.7 pps) and tibialis anterior (13.5 pps), whereas gastrocnemius (11.3 pps), the vastii muscles (11.5 pps) and soleus show the lowest rates (9.9 pps). Sex-stratified analysis shows females exhibit higher discharge rates than males (14.5 vs 11.9 pps; Cohens d=0.38, p=0.018). In contrast, age-stratified analysis reveals non-significant differences between young and older adults (d=-0.24, p=0.072). Collectively, these results show that current views of human motor units are limited to a few muscles, with little data on females and older adults. The complete structured database is available through an open-access interactive platform (https://neuro-mechanix.com/metadata), enabling researchers to explore, filter, and download the extracted metadata. NeuromechaniX provides infrastructure for large-scale meta-research, identification of literature gaps, and hypothesis generation for the neuromechanics community.
Tireli, E. D.; Larsson, H. B. W.; Vestergaard, M. B.; Cramer, S. P.; Lindberg, U.; Tireli, D.
Show abstract
We present p-Brain, an end-to-end neuroimaging analysis framework for reproducible, automated quantitative DCE-MRI analysis at scale. From standard acquisitions, p-Brain estimates baseline relaxation parameters, converts signal to gadolinium concentration, derives arterial and venous input functions using convolutional neural network (CNN) slice selection and ROI segmentation, and produces voxelwise maps with regional and whole-brain summaries. The pipeline implements Patlak graphical analysis to estimate the blood-brain barrier influx constant (Ki) and plasma volume fraction (vp), and performs model-free residue deconvolution with Tikhonov regularisation to estimate cerebral blood flow (CBF), mean transit time (MTT), and capillary transit-time heterogeneity (CTH) from the same DCE dataset. p-Brain exports analysis-ready outputs, intermediate readouts, structured runtime metadata, and stage-level quality control artifacts to support auditability in batch processing. We evaluate the framework on a technically uniform set of 97 DCE-MRI scans from 58 healthy human participants, and show close agreement between automated Patlak Ki summaries and an established reference workflow. A companion macOS desktop application supports batch execution, job monitoring, and rapid review of curves and maps. p-Brain is open-source and configurable, enabling extension to additional kinetic models.
Stowell, D.; Nolasco, I.; McEwen, B.; Vidana Vila, E.; Jean-Labadye, L.; Benhamadi, Y.; Lostanlen, V.; Dubus, G.; Hoffman, B.; Linhart, P.; Morandi, I.; Cazau, D.; White, E.; White, P.; Miller, B.; Nguyen Hong Duc, P.; Schall, E.; Parcerisas, C.; Gros-Martial, A.; Moummad, I.
Show abstract
Computational bioacoustics has seen significant advances in recent decades. However, the rate of insights from automated analysis of bioacoustic audio lags behind our rate of collecting the data - due to key capacity constraints in data annotation and bioacoustic algorithm development. Gaps in analysis methodology persist: not because they are intractable, but because of resource limitations in the bioacoustics community. To bridge these gaps, we advocate the open science method of data challenges, structured as public contests. We conducted a bioacoustics data challenge named BioDCASE, within the format of an existing event (DCASE). In this work we report on the procedures needed to select and then conduct useful bioacoustics data challenges. We consider aspects of task design such as dataset curation, annotation, and evaluation metrics. We report the three tasks included in BioDCASE 2025 and the resulting progress made. Based on this we make recommendations for open community initiatives in computational bioacoustics.
Demsar, J.; Kraljic, A.; Matkovic, A.; Brege, S.; Pan, L.; Tamayo, Z.; Fonteneau, C.; Helmer, M.; Ji, J. L.; Anticevic, A.; Korponay, C.; Salavrakos, M.; Glasser, M. F.; Nickerson, L. D.; Cho, Y. T.; Repovs, G.
Show abstract
Preprocessing and analysis of neuroimaging data are technically demanding, often requiring a combination of multiple software tools, modality-specific pipelines, and extensive parameter tuning to match dataset characteristics. These complexities make it difficult to document workflows in sufficient detail to ensure complete transparency and reproducibility. To address these challenges, we introduce QuNex recipes, a framework for defining and executing complete neuroimaging workflows - encompassing data onboarding, preprocessing, and analysis - in a transparent, machine- and human-readable format. Recipes are implemented as an integrated feature of the Quantitative Neuroimaging Environment & Toolbox (QuNex), a containerized, open-source platform for end-to-end multimodal and multi-species neuroimaging processing. The recipes framework enables seamless integration of QuNex commands with custom scripts and external tools, capturing every processing step and parameter setting. A fully reproducible study can thus be shared and replicated by providing only (a) the QuNex version used, (b) the recipe file, and (c) the data. This approach standardizes workflow specification, enhances transparency, and enables one-command replication of complex neuroimaging analyses. By providing a standardized way to describe and share workflows, recipes facilitate open exchange of best practices and reproducible methods within the neuroimaging community.